Image token removal is an efficient augmentation strategy for reducing the cost of computing image features. However, this efficient augmentation strategy has been found to adversely affect the accuracy of CLIP-based training. We hypothesize that removing a large portion of image tokens may improperly discard the semantic content associated with a given text description, thus constituting an incorrect pairing target in CLIP training. To address this issue, we propose an attentive token removal approach for CLIP training, which retains tokens with a high semantic correlation to the text description. The correlation scores are computed in an online fashion using the EMA version of the visual encoder. Our experiments show that the proposed attentive masking approach performs better than the previous method of random token removal for CLIP training. The approach also makes it efficient to apply multiple augmentation views to the image, as well as introducing instance contrastive learning tasks between these views into the CLIP framework. Compared to other CLIP improvements that combine different pre-training targets such as SLIP and MaskCLIP, our method is not only more effective, but also much more efficient. Specifically, using ViT-B and YFCC-15M dataset, our approach achieves $43.9\%$ top-1 accuracy on ImageNet-1K zero-shot classification, as well as $62.7/42.1$ and $38.0/23.2$ I2T/T2I retrieval accuracy on Flickr30K and MS COCO, which are $+1.1\%$, $+5.5/+0.9$, and $+4.4/+1.3$ higher than the SLIP method, while being $2.30\times$ faster. An efficient version of our approach running $1.16\times$ faster than the plain CLIP model achieves significant gains of $+5.3\%$, $+11.3/+8.0$, and $+9.5/+4.9$ on these benchmarks.
translated by 谷歌翻译
在本文中,我们解决了一次性分段的单次无监督域适应(OSUDA)的问题,其中分段器在训练期间只看到一个未标记的目标图像。在这种情况下,传统的无监督域适应模型通常失败,因为它们不能适应目标域,以具有过度拟合到一个(或几个)目标样本。为了解决这个问题,现有的OSUDA方法通常集成了一种样式传输模块,基于未标记的目标样本执行域随机化,可以在训练期间探讨目标样本周围的多个域。然而,这种样式传输模块依赖于一组额外的图像作为预训练的样式参考,并且还增加了对域适应的内存需求。在这里,我们提出了一种新的奥德达方法,可以有效地缓解这种计算负担。具体而言,我们将多个样式混合层集成到分段器中,该分段器播放样式传输模块的作用,以在不引入任何学习参数的情况下使源图像进行体现。此外,我们提出了一种剪辑的原型匹配(PPM)方法来加权考虑源像素在监督训练期间的重要性,以缓解负适应。实验结果表明,我们的方法在单次设置下的两个常用基准上实现了新的最先进的性能,并且比所有比较方法更有效。
translated by 谷歌翻译
由于维度的诅咒和训练数据的限制,即使对于强大的深度神经网络,近似高维功能是一个非常具有挑战性的任务。灵感来自使用可逆剩余网络(REVNET)的非线性级别集学习(NLL)方法,本文提出了一种通过学习级别集(钻头)的尺寸减少方法,用于函数近似。我们的方法包含两个主要组件:一个是伪可逆神经网络(PRNN)模块,有效地将高维输入变量转换为低维活动变量,另一个是基于变换的近似函数值的合成回归模块低维空间中的数据。 PRNN由于使用RevEN而言,PRNN不仅放宽了NLL方法中存在的非线性变换的可逆性约束,还可以自适应地重量每个样本的影响并控制函数对学习的活动变量的灵敏度。合成的回归使用输入空间中的欧几里德距离来选择相邻样本,其在活动变量的空间上的投影用于执行局部最小二乘性多项式拟合。这有助于解决传统本地和全球回归中存在的数值振荡问题。广泛的实验结果表明,我们的钻探方法优于NLL和有源子空间方法,特别是当目标函数在其输入域内部拥有临界点时。
translated by 谷歌翻译
细粒度的图像识别是具有挑战性的,因为鉴别性线索通常是碎片化的,无论是来自单个图像还是多个图像。尽管有重要的改进,但大多数现有方法仍然专注于从单个图像中的最辨别部分,忽略其他地区的信息细节,缺乏从其他相关图像的线索考虑。在本文中,我们从新的角度分析了微粒图像识别的困难,并提出了一种具有峰值抑制模块和知识引导模块的变压器架构,其尊重单个图像中辨别特征的多样化和鉴别线索的聚合在多个图像中。具体地,峰值抑制模块首先利用线性投影来将输入图像转换为顺序令牌。然后,它基于变压器编码器产生的注意响应来阻止令牌。该模块因特征学习过程中的最辨别部分而受到惩罚,因此,提高了忽视区域的信息利用。知识引导模块将从峰值抑制模块生成的基于图像的表示与被学习的知识嵌入集进行比较,以获得知识响应系数。之后,使用响应系数作为分类分数,将知识学习形式形式化为分类问题。在训练期间更新知识嵌入和基于图像的表示,以便知识嵌入包括不同图像的鉴别线索。最后,我们将所获得的知识嵌入纳入基于形象的表示,作为全面的表示,导致性能显着提高。对六个流行数据集的广泛评估证明了所提出的方法的优势。
translated by 谷歌翻译
部分微分方程通常用于模拟各种物理现象,例如热扩散,波传播,流体动力学,弹性,电动力学和图像处理,并且已经开发了许多分析方法或传统的数值方法并广泛用于其溶液。受深度学习对科学和工程研究的迅速影响的启发,在本文中,我们提出了一个新型的神经网络GF-NET,以无监督的方式学习绿色的线性反应扩散方程的功能。所提出的方法克服了通过使用物理信息的方法和绿色功能的对称性来查找任意域上方程函数的挑战。结果,它尤其导致了在不同边界条件和来源下解决目标方程的有效方法。我们还通过正方形,环形和L形域中的实验证明了所提出的方法的有效性。
translated by 谷歌翻译
Lack of factual correctness is an issue that still plagues state-of-the-art summarization systems despite their impressive progress on generating seemingly fluent summaries. In this paper, we show that factual inconsistency can be caused by irrelevant parts of the input text, which act as confounders. To that end, we leverage information-theoretic measures of causal effects to quantify the amount of confounding and precisely quantify how they affect the summarization performance. Based on insights derived from our theoretical results, we design a simple multi-task model to control such confounding by leveraging human-annotated relevant sentences when available. Crucially, we give a principled characterization of data distributions where such confounding can be large thereby necessitating the use of human annotated relevant sentences to generate factual summaries. Our approach improves faithfulness scores by 20\% over strong baselines on AnswerSumm \citep{fabbri2021answersumm}, a conversation summarization dataset where lack of faithfulness is a significant issue due to the subjective nature of the task. Our best method achieves the highest faithfulness score while also achieving state-of-the-art results on standard metrics like ROUGE and METEOR. We corroborate these improvements through human evaluation.
translated by 谷歌翻译
Most recent head pose estimation (HPE) methods are dominated by the Euler angle representation. To avoid its inherent ambiguity problem of rotation labels, alternative quaternion-based and vector-based representations are introduced. However, they both are not visually intuitive, and often derived from equivocal Euler angle labels. In this paper, we present a novel single-stage keypoint-based method via an {\it intuitive} and {\it unconstrained} 2D cube representation for joint head detection and pose estimation. The 2D cube is an orthogonal projection of the 3D regular hexahedron label roughly surrounding one head, and itself contains the head location. It can reflect the head orientation straightforwardly and unambiguously in any rotation angle. Unlike the general 6-DoF object pose estimation, our 2D cube ignores the 3-DoF of head size but retains the 3-DoF of head pose. Based on the prior of equal side length, we can effortlessly obtain the closed-form solution of Euler angles from predicted 2D head cube instead of applying the error-prone PnP algorithm. In experiments, our proposed method achieves comparable results with other representative methods on the public AFLW2000 and BIWI datasets. Besides, a novel test on the CMU panoptic dataset shows that our method can be seamlessly adapted to the unconstrained full-view HPE task without modification.
translated by 谷歌翻译
While Named Entity Recognition (NER) is a widely studied task, making inferences of entities with only a few labeled data has been challenging, especially for entities with nested structures. Unlike flat entities, entities and their nested entities are more likely to have similar semantic feature representations, drastically increasing difficulties in classifying different entity categories in the few-shot setting. Although prior work has briefly discussed nested structures in the context of few-shot learning, to our best knowledge, this paper is the first one specifically dedicated to studying the few-shot nested NER task. Leveraging contextual dependency to distinguish nested entities, we propose a Biaffine-based Contrastive Learning (BCL) framework. We first design a Biaffine span representation module for learning the contextual span dependency representation for each entity span rather than only learning its semantic representation. We then merge these two representations by the residual connection to distinguish nested entities. Finally, we build a contrastive learning framework to adjust the representation distribution for larger margin boundaries and more generalized domain transfer learning ability. We conducted experimental studies on three English, German, and Russian nested NER datasets. The results show that the BCL outperformed three baseline models on the 1-shot and 5-shot tasks in terms of F1 score.
translated by 谷歌翻译
Each student matters, but it is hardly for instructors to observe all the students during the courses and provide helps to the needed ones immediately. In this paper, we present StuArt, a novel automatic system designed for the individualized classroom observation, which empowers instructors to concern the learning status of each student. StuArt can recognize five representative student behaviors (hand-raising, standing, sleeping, yawning, and smiling) that are highly related to the engagement and track their variation trends during the course. To protect the privacy of students, all the variation trends are indexed by the seat numbers without any personal identification information. Furthermore, StuArt adopts various user-friendly visualization designs to help instructors quickly understand the individual and whole learning status. Experimental results on real classroom videos have demonstrated the superiority and robustness of the embedded algorithms. We expect our system promoting the development of large-scale individualized guidance of students.
translated by 谷歌翻译
开放域对话系统旨在以开放式的方式通过自然语言文本与人类互动。但是,广泛成功的神经网络可能对对话系统无法正常工作,因为它们倾向于产生通用响应。在这项工作中,我们提出了一个相等大小的艰难期望 - 最大化(EQHARD-EM)算法来训练多样化对话生成的多次模型。我们的算法以艰苦的方式将样品分配给解码器,并强加了等同的约束,以确保所有解码器都经过良好的训练。我们提供详细的理论分析以证明我们的方法是合理的。此外,对两个大规模开放域对话数据集进行了实验,验证了我们的eqhard-em算法是否会产生高质量的不同响应。
translated by 谷歌翻译